Local modularity measure for network clusterizations 
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Many compfex networks have an underfying modufar structure, i.e., structurai subunits (com- 
munities or ciusters) characterized by highiy interconnected nodes. The moduiarity Q has been 
introduced as a measure to assess the quaiity of clusterizations. Q has a globai view, while in many 
real- world networks clusters are linked mainly locally among each other [local cluster- connectivity). 
Here, we introduce a new measure, localized modularity LQ, which reflects local cluster structure. 
Optimization of Q and LQ on the clusterization of two biological networks shows that the localized 
modularity identifies more cohesive clusters, yielding a complementary view of higher granularity. 
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Complex networks are a powerful tool for the analy- 
sis of a diverse range of systems including technological 
pi , social 0, 3 , and biological networks 0, |(| . Espe- 
cially in biology, thanks to high-throughput experiments, 
there is a tremendous growth of available data that can 
be efficiently analyzed and summarized in terms of com- 
plex networks I n many cases, networks have an 
inherent modular structure which can represent func- 
tional units, called communities or clusters, e.g., web 
pages of a certain subject Q, social groups @, or 
biological modules 0, [^. However, there is neither 
an obvious and commonly accepted definition of com- 
munities, nor a straightforward way to find the under- 
lying modules of a network. Recently, many cluste ring 
algorithms have been proposed US El EH E E3 E5- 
For a clusterization with K communities, the modular- 
ity Q = X)»=i ( e a - {ai)in(ai)out) has been introduced 
as a measure to assess the quality of a clusterization |l9j , 
where en = jf*-, the effective fraction of links inside 

community i, is compared to (aj)j n (aj) ou t = ( - L '- > '^ j L '- >l 1 
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which is the predicted fraction of edges that fall into com- 
munity i if the links in a directed network are set between 
nodes without regard to the community structure. Q is 
high when the clusterization is good and it can reach a 
maximum value of 1 . Modularity is used to compare the 
quality of different clusterizations, e.g., to find the best 
split of a dendogram [2(| or to validate different clus- 
terization methods and furthermore as fitness function 
in optimization procedures, where Q ma x should corre- 
spond^ the objectively best clusterization of a network 
The modularity is a global measure because the 
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assumes that con- 



nections between all pairs of nodes are equally probable 
which reflects connectivity among all clusters. 

On the other hand, in many complex networks most 
clusters are connected to only a small fraction of the re- 
maining clusters. In metabolic networks, for instance, 
major pathways occur as clusters that are sparsely linked 
among each other Furthermore, in the protein fold- 
ing network |(| communities are energy basins and tran- 
sitions, i.e., connections, are allowed only between ad- 
jacent basins E3- We call this property local cluster- 



connectivity. In this letter, we introduce a new measure 
for the quality of network clusterizations. To take into 
account local cluster-connectivity and overcome global 
network dependency, the approach of modularity is mod- 
ified into a local version. The contribution to modularity 
for each cluster i is calculated for the subnetwork consist- 
ing of cluster i and its neighbor clusters. This requires 
the determination of i's neighborhood or, more precisely, 
all the links Li N that are contained in this neighborhood. 
The sum of the contributions of all K clusters yields 
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We call LQ localized modularity. It is - in contrast to Q 
- not bounded by 1, but can take any value. The more 
locally connected clusters a network has, the higher 
is LQ. On the other hand, in a network where all 
communities are linked among each other, Q and LQ 
coincide. 

It is interesting to compare the behavior of Q and LQ 
on different network topologies and use them as fitness 
functions for the optimization of network clusterizations 
[Til IT1 | . We start with an illustration of the differences 
between Q and LQ by discussing a simple example of a 
scalable local cluster-connectivity network, which we call 
the school network (Fig.QjV). It is a toy model of social 
interactions between pupils in a school with I levels and 
c classes per level. Levels have periodic boundary condi- 
tions to avoid spurious boundary effects (in the first and 
last level). In a real school, all the students of a class 
know each other and, as a first approximation, a stu- 
dent would interact most with people of his/her age. In 
the school network model, students are the nodes of the 
network and a link between two pupils is made if they 
know each other. Each class contains s fully connected 
students. A link between two students of the same level 
but different classes is placed with a (high) probability 
p < 1 and connections between students that are one 
level above/below (+1, Fig. [TJV) are made with smaller 
probability r < p. No social interaction is assumed be- 
tween persons that are more than one level apart from 
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FIG. 1: (A) A student's view in the simplified schematic school network model with only 3 levels, 3 classes/level and 4 
students/class: The student interacts with all his classmates, with other students on the same level with probability p — 0.5 
and with pupils one level above/below (+1) with probability r = 0.25. No connections are assumed between students that 
are more than one level apart (+2 or more). (B) The p-dependent behavior of the modularity and the localized modularity 
in the school network with 10 levels, 2 classes per level, 20 pupils per class and r = |. The modularity favors the grouping 
of classes (solid line) in the same level for almost all p, whereas localized modularity favors communities consisting of single 
classes (dot-dashed line) for p <0.42. 



each other, i.e, if one of the students is more than one year 
older than the other (+2 or more, Fig.^^)- Interestingly, 
when only two levels and two classes per level are consid- 
ered, the school network model is essentially the same as 
the well-known (globally connected) 4 communities test 
network used in [llL Il4| . Hence, the school network is a 
simple generalization to locally connected networks. It is 
unweighted and undirected but an extension to directed 
and weighted networks, e.g., asymmetrical friendship, is 
straightforward. 

A grouping of all the pupils on one level into the same 
cluster is reasonable for high p, i.e., when students of the 
same age interact among each other with high probabil- 
ity. But, as p decreases, classes become more and more 
separated from each other until they fully break apart 
for p — 0, where a fitness measure is expected to favor 
clusterizations that identify classes. Therefore, we calcu- 
lated modularity and localized modularity for the clus- 
terization of nodes according to classes and according to 
levels for p e [0, 1], r = | and s = 20 students per class. 
Figure^? shows the Q- and LQ- values for 10 levels and 2 
classes per level. They were obtained analytically, using 
the expected numbers of links for each p. Both Q and 
LQ favor the clusterization into levels for p close to 1. 
LQ yields the same value for both clusterizations (cross- 
ing point) at p^Q = 0.42 and prefers the clusterization 
into classes for p < 0.42. The modularity, on the other 
hand, has its crossing point at p|? = 0.09, i.e., it favors 
the classes only for p < 0.09. In other words, Q considers 
the classes and not the levels as the best cluster parti- 
tion only if the probability of interaction between two 
students of the same age but different classes is smaller 
than 10%. 

The crossing point p c depends on the number of levels 
and classes. Figure |21 shows the change of p c upon varia- 
tion of these two parameters with 2, 5 and 10 classes per 



level, respectively (from top to bottom). It can be seen 
that is higher than pj? for all values of levels and 
classes, and is by construction constant for a fixed num- 
ber of classes per level. On the other hand, p|? strongly 
depends on network size which means that it favors dif- 
ferent clusterizations as the number of levels increases, 
i.e., the lens of cluster detection becomes more coarse. 
Furthermore, it converges to as I grows, meaning that 
Q favors the clusterization into levels for any p S [0, 1], 
even though the classes on the same level are almost dis- 
connected for small p. 
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FIG. 2: Dependence of p c on network size: For 2, 5 and 
10 classes/level (from top to bottom), p^® (dotted lines) is 
always higher than pj? (solid lines) showing that LQ favors 
the clusterization into classes for higher p while Q almost 
always prefers the grouping into levels. Moreover, p® is rather 
sensitive on the size of the network and converges to as the 
network grows, while p^f ® does not depend on the number of 
levels. 

These observations indicate that LQ is more reliable 
than Q to validate clusterizations in local cluster- 
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connectivity networks. The discrepancies between the 
two measures originate from the fact that Q compares 
the effective to the expected fraction of links in the 
clusters, no matter if a link is possible or not. The 
expected fraction of links is therefore underestimated in 
local cluster-connectivity networks, thus the difference 
between expected and effective fraction of links (i.e., Q) 
is overestimated. On the other hand, LQ only takes into 
account local link-expectations. Furthermore, note that 
modularity as high as 0.8 has been found in Erdos-Reni 
(ER) random graphs, scale-free networks and regular 
lattices [2lLl22|. 

In the last years, biological networks [2j| have at- 
tracted the attention of many scientists for their po- 
tential impact on the understanding of living systems. 
Metabolic and protein-protein interaction networks have 
been clustered by Q optimization and the MCL 
method |24|, respectively. To investigate the behavior of 
Q and LQ on real- world networks we optimized the clus- 
terizations of two recent realizations of the metabolic and 
protein-protein interaction networks of E. coli by simu- 
lated annealing (SA), using each of the two measures as 
cost function. For each temperature T, c\n 2 single-node 
and C2n multi-node moves, like splitting and merging of 
(adjacent) communities, were performed, where c\ t 2 are 
constants and n is the number of nodes in the network. 
Furthermore, T was iteratively reduced to C3T with a 
constant C3 < 1. This move set and cooling scheme is 
similar to the one used in . The computational effort 
for the two measures scales as O(K), even though the cal- 
culation of LQ is slightly more expensive since it involves 
the determination of neighborhoods for each cluster. 

(i) The metabolic network of E.coli. We use the 
metabolic pathway database developed by Ma and Zeng 
[25j . which has been derived from the Kyoto Encyclope- 
dia of Genes and Genomes (KEGG) [2(1 . Figure shows 
the largest connected component of the E.coli metabolic 
network in this database. It contains 563 nodes and 708 
links which have been treated undirected. Each node is 
assigned to between zero and nine out of 11 possible path- 
ways. The optimization with fitness function Q leads to 
a division into 16 clusters consisting of 35 metabolites 
on average (as colored in Figure |3J) and takes a value 
as high as Q m ax = 0.82. On the other hand, LQ opti- 
mization leads to a maximum of LQ max — 12.1 with 132 
clusters, containing each an average of 4.3 metabolites. 
The optimization of the two measures finds clusters at a 
different level, which yields complementary information. 
As expected, Q is based on a global view and depends 
on the size of the network. As a consequence, optimizing 
a network with more metabolites would lead to larger 
Q clusters. This problem is likely to arise because, as 
more data become available, the network and its largest 
connected component will grow. On the other hand, LQ 
finds the lowest-level modules, independent on the rest 
of the network. Still, a mayor motivation to find clus- 
ters is to obtain information about presumed pathways 



of non-annotated metabolites. Figure|3j3 zooms into one 
of the Q clusters (white) and shows the splitting into 
smaller LQ clusters. The numbers indicate the respec- 
tive pathway(s) of the nodes. Note that an LQ cluster 
is not necessarily fully contained in a Q cluster, i.e., a 
smaller (local) cluster may be only partially contained 
in a larger one. In the considered cluster of Figure [3)3, 
the further division is justified because it results in more 
homogeneous subclusters. The yellow community, for in- 
stance, contains mainly nodes belonging to the carbohy- 
drate metabolism pathway (label 3). According to this, 
the unassigned node (N-Acetyl-alpha-D-glucosamine 1- 
phosphate, labeled as "?" in Fig. |3)3) can also be clas- 
sified in pathway 3 with a high confidence. This would 
have been impossible when considering the white clus- 
ter obtained by Q whose nodes are assigned mainly to 
pathway 6 (Glycan biosynthesis and metabolism) and 1 
(Amino-acid metabolism) . 
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FIG. 3: (Color online) (A) Largest connected component 
of the metabolic network of E.coli. The coloring scheme 
represents the clusterization found by optimizing modular- 
ity. Some colors are used twice. (B) LQ clusteriza- 
tion of the white Q cluster with the annotation of differ- 
ent pathways. According to LQ it is highly probable that 
the unassigned yellow node (N-Acetyl-alpha-D-glucosamine 
1-phosphate, marked as "?") belongs to the carbohydrate 
metabolism (label 3). 

To obtain a more quantitative analysis, we compute 
the conditioned probability 

P[i,j] = P[Tr(i)nir(j)jt<b\c(i)=c(j)] 

that two nodes i and j, lying in the same cluster c, share 
at least one pathway (tt). For the Q clusterization, this 
probability is Pq[i, j] = 0.57, while Plq[1, j] — 0.73, re- 
flecting the higher homogeneity of the LQ clusters. Com- 
parison to the null-case, where nodes are picked at ran- 
dom from the network, yields Pp\i,j] — 0.26 and the 
probability that any pair of linked nodes shares a path- 
way is 0.59, thus essentially the same as for the clustering 
with Q. 

(ii) The protein-protein interaction (PPI) network of 
E.coli. A set of 716 verified interactions involving 270 
proteins of E.coli has been reported [2^. We again fo- 
cused on the largest connected component consisting of 
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230 proteins and 695 undirected connections (Figure |2J. 
Identifying clusters can help to find indications about 
the function of unknown proteins. Again, modularity 
and localized modularity differ in the granularity of the 
clusters, similar to using two different lenses of a micro- 
scope. While the highest value for Q has been found for 
a clusterization with 7 communities (Q m ax = 0.49), LQ 
splits the network into 56 communities (LQ max — 2.97). 
An example where LQ yields a more accurate "guess" is 
given in Figure 0)3, where the LQ clusterization further 
subdivides the black cluster of Figure The proteins 
in the green circle are part of the DNA polymerase com- 
plex (dnaE, dnaQ, dnaX, dnaQ, holA, holB, holC, holD 
and holE). According to LQ, the unknown protein bl808 
appears to be a protein of this complex. On the other 
hand, the black cluster obtained by Q is more hetero- 
geneous which makes a functional assignment of bl808 
difficult. 




In conclusion, a new measure for the quality of 
network-clusterizations, called localized modularity, has 
been introduced and compared to the widely used mod- 
ularity. Both measures can be used essentially in the 
same way. The latter has been applied previously by 
others to assess the clusterization quality in many net- 
works and has been used to find the best split of a 
dendogram and as fitness function in optimization al- 
gorithms. Finding clusters by optimizing a given fitness 
function has the advantage of not using any param eters 
(unlike many other clustering methods [l5[ IbA Il8j). Q 
depends on global properties like the network size and 
the cluster-connectivity. However, in many real-world 
networks, communities are merely connected locally, i.e., 
most pairs of clusters are not linked. We have called such 
organization local cluster- connectivity. By detailed inves- 
tigation of model networks as well as the optimization of 
Q and LQ on two biological networks, we have provided 
evidence that the two measures give a view of different 
depth into the cluster structure. In contrast to Q, LQ 
takes into account individual clusters and their nearest 
neighbors, generating high-confident clusters, irrespec- 
tive of the rest of the network. Thus, the two measures 
provide complementary information. Furthermore, the 
LQ approach can be generalized to 2 nd or higher nearest 
neighbors which, albeit computationally more expensive, 
might yield additional insights, as if one were to use dif- 
ferent lenses of a microscope. 

This work was supported by a grant from the Swiss Na- 
tional Science Foundation. 



FIG. 4: (Color online) (A) Largest connected component 
of the PPI of E.coli. The colors represent the clusterization 
found by optimizing modularity. (B) LQ clusterization of the 
black Q cluster. The green circle contains proteins belong- 
ing to the DNA polymerase complex. The unknown protein 
bl808 is assigned to this complex according to LQ while the 
complete Q cluster is heterogeneous. 
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